-
-
Notifications
You must be signed in to change notification settings - Fork 213
feat(schema): document #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @luoling8192, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request implements a new feature for document processing and storage. It introduces a dedicated "documents" table in the database to house scraped web content, along with its processed versions, summaries, and vector embeddings for efficient retrieval. The existing link resolver has been enhanced to automatically detect URLs in chat messages, scrape their content, and persist this information, enabling richer data analysis and search capabilities within the application.
Highlights
- Document Storage: Introduced a new "documents" table in the database to store scraped web content, summaries, and vector embeddings.
- Web Scraping Integration: The "link-resolver" now automatically scrapes content from URLs found in chat messages using r.jina.ai and stores it in the new "documents" table.
- New Dependencies: Added @mendable/firecrawl-js to facilitate web scraping, along with its associated transitive dependencies.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
✅ Deploy Preview for tgsearch ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new documents table and related functionality to scrape and store content from links found in messages. It adds a link resolver using Jina's reader service and a new dependency firecrawl-js. While this is a good feature addition, the review has identified several critical issues in the implementation of document retrieval, potential data integrity problems with foreign keys, and areas for improvement in the link resolver and schema definitions. Addressing these points will improve the correctness, robustness, and maintainability of the new feature.
| const relevantDocs = (await withDb(async db => db | ||
| .select() | ||
| .from(documentsTable) | ||
| .where(sql`${documentsTable.processed_content_jieba_tokens} @> ${content.text}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text search query sql${documentsTable.processed_content_jieba_tokens} @> ${content.text}`` is incorrect. The content.text is a raw string, but a `jsonb` containment query (`@>`) on `processed_content_jieba_tokens` expects a JSONB array of tokens. The input text should be tokenized first (e.g., using Jieba), and then the tokens should be passed to the SQL query as a JSON array string, for example: `sql`... @> ${JSON.stringify(tokens)}::jsonb`.
| const relevantDocs = (await withDb(async db => db | ||
| .select() | ||
| .from(documentsTable) | ||
| .where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector similarity operator <=> is being used in a WHERE clause. This operator calculates the distance and is typically used in an ORDER BY clause to sort results by similarity. Using it in a WHERE clause without a comparison is incorrect. You should use orderBy to sort by distance.
| .where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`) | |
| .orderBy(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`) |
| const documents = [] | ||
|
|
||
| if (content.text) { | ||
| const relevantDocs = (await withDb(async db => db | ||
| .select() | ||
| .from(documentsTable) | ||
| .where(sql`${documentsTable.processed_content_jieba_tokens} @> ${content.text}`) | ||
| .limit(pagination?.limit || 10) | ||
| .offset(pagination?.offset || 0), | ||
| )).unwrap() | ||
| logger.withFields({ relevantDocs: relevantDocs.length }).verbose('Retrieved jieba documents') | ||
| documents.push(...relevantDocs) | ||
| } | ||
|
|
||
| if (content.embedding && content.embedding.length > 0) { | ||
| const relevantDocs = (await withDb(async db => db | ||
| .select() | ||
| .from(documentsTable) | ||
| .where(sql`${documentsTable.summary_vector_1536} <=> ${content.embedding}`) | ||
| .limit(pagination?.limit || 10) | ||
| .offset(pagination?.offset || 0), | ||
| )).unwrap() | ||
| logger.withFields({ relevantDocs: relevantDocs.length }).verbose('Retrieved vector documents') | ||
| documents.push(...relevantDocs) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retrieveDocuments function performs two separate database queries for text and vector search, and then concatenates the results. This approach has several issues:
- Duplicates: It can return duplicate documents if a document matches both text and vector queries.
- Incorrect Pagination: The
limitandoffsetare applied to each query separately, not to the combined result set. This means you could get up to2 * limitresults, and pagination will not work as expected. - Inconsistent Ranking: The results from the two queries are not ranked against each other; they are simply appended. This can lead to less relevant results appearing before more relevant ones.
A better approach would be to combine these into a single query if possible, or fetch results and then merge, de-duplicate, and re-rank them in the application before applying pagination.
|
|
||
| export const documentsTable = pgTable('documents', { | ||
| id: uuid().primaryKey().defaultRandom(), | ||
| chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The foreign key on chat_messages_id defaults to ON DELETE NO ACTION. This will cause an error when trying to delete a chat message that has associated documents, which can lead to data integrity issues or unexpected application behavior. If documents should be deleted along with their corresponding message, consider using onDelete: 'cascade'.
| chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id), | |
| chat_messages_id: uuid().notNull().references(() => chatMessagesTable.id, { onDelete: 'cascade' }), |
| return Ok([] as CoreMessage[]) | ||
| for (const message of opts.messages) { | ||
| if (message.content) { | ||
| const links = message.content.match(/https?:\/\/\S+/g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (links && links.length > 0) { | ||
| for (const link of links) { | ||
| try { | ||
| const response = await fetch(`https://r.jina.ai/${link}`, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| await recordDocuments([{ | ||
| chatMessagesId: message.uuid, | ||
| rawContent: doc, | ||
| processedContent: doc, | ||
| processedContentJiebaTokens: [], | ||
| summary: doc, | ||
| }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scraped document is used for rawContent, processedContent, and summary without any processing. Also, processedContentJiebaTokens is an empty array. For these fields to be effective for search and retrieval, the content should be processed (e.g., cleaned, tokenized) and a summary should be generated. This implementation seems like a placeholder and should be enhanced for the feature to be fully functional.
| }, table => [ | ||
| index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')), | ||
| index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')), | ||
| index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')), | ||
| index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')), | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recordDocuments function in packages/core/src/models/documents.ts uses onConflictDoUpdate with chat_messages_id as the target. This implies that chat_messages_id should be unique. To enforce this at the database level and for clarity, you should add a unique index on this column.
| }, table => [ | |
| index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')), | |
| index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')), | |
| index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')), | |
| index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')), | |
| ]) | |
| }, table => [ | |
| uniqueIndex('documents_chat_messages_id_unique_index').on(table.chat_messages_id), | |
| index('documents_summary_vector_1536_index').using('hnsw', table.summary_vector_1536.op('vector_cosine_ops')), | |
| index('documents_summary_vector_1024_index').using('hnsw', table.summary_vector_1024.op('vector_cosine_ops')), | |
| index('documents_summary_vector_768_index').using('hnsw', table.summary_vector_768.op('vector_cosine_ops')), | |
| index('documents_processed_content_jieba_tokens_index').using('gin', table.processed_content_jieba_tokens.op('jsonb_path_ops')), | |
| ]) |
No description provided.